Improve performance of extract operations #172

simonw · 2020-09-23T14:40:50Z

This command took about 12 minutes (against a 150MB file with 680,000 rows in it):

sqlite-utils extract salaries.db salaries \
   'Organization Group Code' 'Organization Group' \
  --table 'organization_groups' \
  --fk-column 'organization_group_id' \
  --rename 'Organization Group Code' code \
  --rename 'Organization Group' name

I'm pretty confident we can do better than that.

The text was updated successfully, but these errors were encountered:

simonw · 2020-09-23T14:41:17Z

Steps to produce that database:

curl -o salaries.csv 'https://data.sfgov.org/api/views/88g8-5mnd/rows.csv?accessType=DOWNLOAD'
sqlite-utils insert salaries.db salaries salaries.csv --csv

simonw · 2020-09-23T14:42:03Z

Here's the loop that's taking the time:

sqlite-utils/sqlite_utils/db.py

Lines 892 to 897 in 1ebffe1

    
           for row in rows_iter: 
        
               row_pks = tuple(row[pk] for pk in pks) 
        
               lookups = {rename.get(column) or column: row[column] for column in columns} 
        
               self.update(row_pks, {first_column: lookup_table.lookup(lookups)}) 
        
               if progress: 
        
                   progress(1)

simonw · 2020-09-23T14:45:13Z

lookup_table.lookup(lookups) is doing a SQL lookup. This could be cached in-memory, maybe with a LRU cache, to avoid looking up the primary key for records that we have recently used.

The .update() method it is calling first does a get() and then does a SQL UPDATE ... WHERE:

sqlite-utils/sqlite_utils/db.py

Lines 1244 to 1264 in 1ebffe1

    
           def update(self, pk_values, updates=None, alter=False, conversions=None): 
        
               updates = updates or {} 
        
               conversions = conversions or {} 
        
               if not isinstance(pk_values, (list, tuple)): 
        
                   pk_values = [pk_values] 
        
               # Soundness check that the record exists (raises error if not): 
        
               self.get(pk_values) 
        
               if not updates: 
        
                   return self 
        
               args = [] 
        
               sets = [] 
        
               wheres = [] 
        
               validate_column_names(updates.keys()) 
        
               for key, value in updates.items(): 
        
                   sets.append("[{}] = {}".format(key, conversions.get(key, "?"))) 
        
                   args.append(value) 
        
               wheres = ["[{}] = ?".format(pk_name) for pk_name in self.pks] 
        
               args.extend(pk_values) 
        
               sql = "update [{table}] set {sets} where {wheres}".format( 
        
                   table=self.name, sets=", ".join(sets), wheres=" and ".join(wheres) 
        
               )

Batching those updates may have an effect. Or finding a way to skip the .get() since we already know we have a valid record.

simonw · 2020-09-23T18:22:49Z

I ran sudo py-spy top -p 123 against the process while it was running and the most time is definitely spent in .update():

Total Samples 1000
GIL: 0.00%, Active: 90.00%, Threads: 1

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)                                                                                                                                  
 38.00%  38.00%    3.85s     3.85s   update (sqlite_utils/db.py:1283)
 27.00%  27.00%    2.12s     2.12s   execute (sqlite_utils/db.py:161)
 10.00%  10.00%   0.890s    0.890s   execute (sqlite_utils/db.py:163)
 10.00%  17.00%   0.870s     1.54s   columns (sqlite_utils/db.py:553)
  0.00%   0.00%   0.110s    0.210s   <listcomp> (sqlite_utils/db.py:554)
  0.00%   3.00%   0.100s    0.320s   table_names (sqlite_utils/db.py:191)
  0.00%   0.00%   0.100s    0.100s   __new__ (<string>:1)

simonw · 2020-09-23T18:38:43Z

I wonder if I could make this faster by separating it out into a few steps:

Create the new lookup table with all of the distinct rows
Add the blank foreign key column
run a UPDATE table SET blah_id = (select id from lookup where thang = table.thang)
Drop the value columns

simonw · 2020-09-23T18:41:06Z

Problem with this approach is it's not compatible with progress bars - but if it's a multiple of times faster it's worth it.

simonw · 2020-09-23T18:43:37Z

Also what would happen if the table had new rows added to it while that command was running?

simonw · 2020-09-23T18:45:30Z

There's something to be said for making this operation pausable and resumable, especially if I'm going to make it available in a Datasette plugin at some point.

Refs #172 - seems to give me about 20% speedup.

simonw · 2020-09-24T07:48:57Z

I wonder if I could make this faster by separating it out into a few steps:

* Create the new lookup table with all of the distinct rows

* Add the blank foreign key column

* run a `UPDATE table SET blah_id = (select id from lookup where thang = table.thang)`

* Drop the value columns

My prototype of this knocked the time down from 10 minutes to 4 seconds, so I think the change is worth it!

% date
sqlite-utils extract salaries.db salaries \
   'Department Code' 'Department' \
  --table 'departments' \
  --fk-column 'department_id' \
  --rename 'Department Code' code \
  --rename 'Department' name
date
sqlite-utils extract salaries.db salaries \
   'Union Code' 'Union' \
  --table 'unions' \
  --fk-column 'union_id' \
  --rename 'Union Code' code \
  --rename 'Union' name
date
sqlite-utils extract salaries.db salaries \
   'Job Family Code' 'Job Family' \
  --table 'job_families' \
  --fk-column 'job_family_id' \
  --rename 'Job Family Code' code \
  --rename 'Job Family' name
date
sqlite-utils extract salaries.db salaries \
   'Job Code' 'Job' \
  --table 'jobs' \
  --fk-column 'job_id' \
  --rename 'Job Code' code \
  --rename 'Job' name
date
Thu Sep 24 00:48:16 PDT 2020

Thu Sep 24 00:48:20 PDT 2020

Thu Sep 24 00:48:24 PDT 2020

Thu Sep 24 00:48:28 PDT 2020

Thu Sep 24 00:48:32 PDT 2020

Takes my test down from ten minutes to four seconds.

Refs #172, #175, #176, #177

simonw added the enhancement New feature or request label Sep 23, 2020

simonw added a commit that referenced this issue Sep 23, 2020

Some optimizations for extract()

66d5065

Refs #172 - seems to give me about 20% speedup.

simonw added a commit that referenced this issue Sep 24, 2020

Much, much faster extract() implementation, refs #172

61015c8

Takes my test down from ten minutes to four seconds.

simonw mentioned this issue Sep 24, 2020

Much, much faster extract() implementation #174

Merged

simonw added a commit that referenced this issue Sep 24, 2020

Removed unnecessary progress bar, refs #172

bf15acd

simonw added a commit that referenced this issue Sep 24, 2020

Fix for column ordering after .extract(), refs #172

01ff093

simonw closed this as completed in 022cdd9 Sep 24, 2020

simonw added a commit that referenced this issue Sep 24, 2020

Release 2.21

a57acf8

Refs #172, #175, #176, #177

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of extract operations #172

Improve performance of extract operations #172

simonw commented Sep 23, 2020

simonw commented Sep 23, 2020

simonw commented Sep 23, 2020

simonw commented Sep 23, 2020

simonw commented Sep 23, 2020

simonw commented Sep 23, 2020 •

edited

simonw commented Sep 23, 2020

simonw commented Sep 23, 2020

simonw commented Sep 23, 2020

simonw commented Sep 24, 2020 •

edited

Improve performance of extract operations #172

Improve performance of extract operations #172

Comments

simonw commented Sep 23, 2020

simonw commented Sep 23, 2020

simonw commented Sep 23, 2020

simonw commented Sep 23, 2020

simonw commented Sep 23, 2020

simonw commented Sep 23, 2020 • edited

simonw commented Sep 23, 2020

simonw commented Sep 23, 2020

simonw commented Sep 23, 2020

simonw commented Sep 24, 2020 • edited

simonw commented Sep 23, 2020 •

edited

simonw commented Sep 24, 2020 •

edited